DACSS 603 HW 1 Kimble

HW 1 for DACSS 603
Author

Karen Kimble

Published

October 3, 2022

Question 1: Lung Capacity

Setup

Code
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
library(readxl)
library(tidyverse)
── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.5
✔ tibble  3.1.8     ✔ stringr 1.4.1
✔ tidyr   1.2.1     ✔ forcats 0.5.2
✔ readr   2.1.3     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
knitr::opts_chunk$set(echo = TRUE)

# Reading in File
LungCapData <- read_excel("_data/LungCapData.xls")

Part A: Distribution

Code
hist(LungCapData$LungCap)

The histogram above shows that the Lung Cap data is roughly normally distributed because a majority of the observations are centered around the mean. There are fewer observations at the tail ends of the histogram.

Part B: Probability Distribution of LungCap (Males vs. Females)

Code
boxplot(LungCap ~ Gender, data = LungCapData, main = "Lung Capacity by Gender",
        xlab = "Gender", ylab = "Lung Capacity")

From the box-plots above, it appears that males in this study had slightly higher lung capacities than females, with the median for males at 9 and the median for females at 8. However, both genders had large ranges, but these ranges reflected the overall pattern of males having slightly higher lung capacities.

Part C: Smokers vs. Non-Smokers

Code
smokers <- filter(LungCapData, Smoke == "yes")
mean(smokers$LungCap)
[1] 8.645455
Code
nonsmokers <- filter(LungCapData, Smoke == "no")
mean(nonsmokers$LungCap)
[1] 7.770188

The mean lung capacity for smokers (8.65) is higher than the mean lung capacity for non-smokers (7.77). Based on what we now know about how smoking affects the lungs, these results don’t seem to make sense. However, there is the possibility that smokers may be more used to deep inhales/exhales and therefore could have better lung capacity until the substance has more of an effect on their lungs. There may also be external factors that led to these results that aren’t clear from the data right now.

Part D: Lung Capacity by Smoker/Non-Smoker and Age

Code
LungCapData <- within(LungCapData, {
  Age.group <- NA
  Age.group[Age <= 13] <- "13 and Under"
  Age.group[Age >= 14 & Age <= 15] <- "14-15"
  Age.group[Age >= 16 & Age <= 17] <- "16-17"
  Age.group[Age >= 18] <- "18 and Over"
} )

Smokers

Code
# Boxplots

smoking_age <- filter(LungCapData, Smoke == "yes")

boxplot(LungCap ~ Age.group, data = smoking_age,
        main = "Lung Capacity of Smokers by Age Group",
        xlab = "Age Group", ylab = "Lung Capacity")

From the boxplot above, we can see that smokers’ lung capacities reach about a maximum of 12 as age increases, but there is not very much improvement in the maximums. The medians move a bit more as age increases, but still not very dramatically after ages 14 and 15. Smokers that are 18 and over have higher lung capacities overall, but this may just be because of natural aging processes and development.

Code
# Means

smoking_age %>%
  group_by(Age.group) %>%
  summarise_at(vars(LungCap), list(name = mean))
# A tibble: 4 × 2
  Age.group     name
  <chr>        <dbl>
1 13 and Under  7.20
2 14-15         8.39
3 16-17         9.38
4 18 and Over  10.5 

We see the same trend in means as in the medians: mean lung capacity to increases as the age increases.

Non-Smokers

Code
# Boxplot

nonsmoking_age <- filter(LungCapData, Smoke == "no")

boxplot(LungCap ~ Age.group, data = nonsmoking_age,
        main = "Lung Capacity of Non-Smokers by Age Group",
        xlab = "Age Group", ylab = "Lung Capacity")

In non-smokers, we see the same trend of increasing lung capacities as age increases, but the median lung capacities in the two older age groups in the non-smoking group are higher than those in the smoking group. There are also more outliers for non-smokers, especially in the 14-15 category.

Code
# Means

nonsmoking_age %>%
  group_by(Age.group) %>%
  summarise_at(vars(LungCap), list(name = mean))
# A tibble: 4 × 2
  Age.group     name
  <chr>        <dbl>
1 13 and Under  6.36
2 14-15         9.14
3 16-17        10.5 
4 18 and Over  11.1 

The means of the non-smoking group by age follow the same trend as the medians, as well as in the smoking group. However, the mean lung capacity for the oldest two age groups in the non-smoking category are higher than the means for those groups in the smoking category.

Part E: Lung Capacities for Smokers and Non-Smokers within Age Group

13 and Under

Code
LungCapData %>%
  filter(Age.group == "13 and Under") %>%
  group_by(Smoke) %>%
  summarise_at(vars(LungCap), list(name = mean))
# A tibble: 2 × 2
  Smoke  name
  <chr> <dbl>
1 no     6.36
2 yes    7.20

The mean lung capacity for smokers is higher than the mean lung capacity for non-smokers in the age group 13 and under, which mirrors the general means we found earlier. However, from the boxplot of Smokers by Age Group, we can see that there is a very low outlier in this age group, which might be affecting the mean for this group as well as overall smokers.

14-15

Code
LungCapData %>%
  filter(Age.group == "14-15") %>%
  group_by(Smoke) %>%
  summarise_at(vars(LungCap), list(name = mean))
# A tibble: 2 × 2
  Smoke  name
  <chr> <dbl>
1 no     9.14
2 yes    8.39

In this age group, the mean lung capacity for non-smokers is higher than the mean lung capacity for smokers–unlike the younger group.

16-17

Code
LungCapData %>%
  filter(Age.group == "16-17") %>%
  group_by(Smoke) %>%
  summarise_at(vars(LungCap), list(name = mean))
# A tibble: 2 × 2
  Smoke  name
  <chr> <dbl>
1 no    10.5 
2 yes    9.38

The same trend continues in this age group, with the mean lung capacity in non-smokers ages 16 and 17 higher than the mean lung capacity of smokers in this group. Yet as the ages increase, the mean lung capacities for non-smokers and smokers increase about the same amount (by 1).

18 and Over

Code
LungCapData %>%
  filter(Age.group == "18 and Over") %>%
  group_by(Smoke) %>%
  summarise_at(vars(LungCap), list(name = mean))
# A tibble: 2 × 2
  Smoke  name
  <chr> <dbl>
1 no     11.1
2 yes    10.5

In this oldest age group, the same trend continues: the mean lung capacity for non-smokers is higher than that of smokers. This pattern in the groups 18+, 16-17, and 14-15 are not found in the overall means for smokers and nonsmokers, suggesting that the outlier in the 13 and Under group might have brought down the overall mean for smokers.

Part F: Correlation & Covariance

Code
# Correlation

cor(LungCapData$Age, LungCapData$LungCap, use = "everything")
[1] 0.8196749

The correlation between lung capacity and age is positive and strong. As age increases, lung capacity also increases. The value of 0.8 is close to 1, meaning there is a somewhat strong relationship between the two variables.

Code
# Covariance

cov(LungCapData$Age, LungCapData$LungCap, use = "everything")
[1] 8.738289

The covariance is positive, meaning that there is a positive relationship between the varaibles, which is also clear from the correlation (since the correlation coefficient is a function of the covariance). Age and lung capacity have an overall positive relationship: as age increases, so does lung capacity.

Question 2: Prisoner Convictions

Part A

Code
160/810
[1] 0.1975309

Part B

Code
(434 + 128)/810
[1] 0.6938272

Part C

Code
(160 + 434 + 128)/810
[1] 0.891358

Part D

Code
(64 + 24)/810
[1] 0.108642

Part E

Code
# Creating vector
convict <- c(rep(0, 128), rep(1, 434), rep(2, 160), rep(3, 64), rep(4, 24))

weighted.mean(convict)
[1] 1.28642

The expected value for the number of prior convictions is 1.27–but since prior convictions have to be a whole number, that would be rounded to 1.

Part F

Code
var(convict)
[1] 0.8572937
Code
sd(convict)
[1] 0.9259016